173 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
An Efficient Source Model Selection Framework in Model Databases
With the explosive increase of big data, training a Machine Learning (ML)
model becomes a computation-intensive workload, which would take days or even
weeks. Thus, reusing an already trained model has received attention, which is
called transfer learning. Transfer learning avoids training a new model from
scratch by transferring knowledge from a source task to a target task. Existing
transfer learning methods mostly focus on how to improve the performance of the
target task through a specific source model, and assume that the source model
is given. Although many source models are available, it is difficult for data
scientists to select the best source model for the target task manually. Hence,
how to efficiently select a suitable source model in a model database for model
reuse is an interesting but unsolved problem. In this paper, we propose SMS, an
effective, efficient, and flexible source model selection framework. SMS is
effective even when the source and target datasets have significantly different
data labels, and is flexible to support source models with any type of
structure, and is efficient to avoid any training process. For each source
model, SMS first vectorizes the samples in the target dataset into soft labels
by directly applying this model to the target dataset, then uses Gaussian
distributions to fit for clusters of soft labels, and finally measures the
distinguishing ability of the source model using Gaussian mixture-based metric.
Moreover, we present an improved SMS (I-SMS), which decreases the output number
of the source model. I-SMS can significantly reduce the selection time while
retaining the selection performance of SMS. Extensive experiments on a range of
practical model reuse workloads demonstrate the effectiveness and efficiency of
SMS
SEA: A Scalable Entity Alignment System
Entity alignment (EA) aims to find equivalent entities in different knowledge
graphs (KGs). State-of-the-art EA approaches generally use Graph Neural
Networks (GNNs) to encode entities. However, most of them train the models and
evaluate the results in a fullbatch fashion, which prohibits EA from being
scalable on largescale datasets. To enhance the usability of GNN-based EA
models in real-world applications, we present SEA, a scalable entity alignment
system that enables to (i) train large-scale GNNs for EA, (ii) speed up the
normalization and the evaluation process, and (iii) report clear results for
users to estimate different models and parameter settings. SEA can be run on a
computer with merely one graphic card. Moreover, SEA encompasses six
state-of-the-art EA models and provides access for users to quickly establish
and evaluate their own models. Thus, SEA allows users to perform EA without
being involved in tedious implementations, such as negative sampling and
GPU-accelerated evaluation. With SEA, users can gain a clear view of the model
performance. In the demonstration, we show that SEA is user-friendly and is of
high scalability even on computers with limited computational resources.Comment: SIGIR'23 Demo Trac
Optimal-Location-Selection Query Processing in Spatial Databases
Abstract—This paper introduces and solves a novel type of spatial queries, namely, Optimal-Location-Selection (OLS) search, which has many applications in real life. Given a data object set DA, a target object set DB, a spatial region R, and a critical distance dc in a multidimensional space, an OLS query retrieves those target objects in DB that are outside R but have maximal optimality. Here, the optimality of a target object b 2 DB located outside R is defined as the number of the data objects from DA that are inside R and meanwhile have their distances to b not exceeding dc. When there is a tie, the accumulated distance from the data objects to b serves as the tie breaker, and the one with smaller distance has the better optimality. In this paper, we present the optimality metric, formalize the OLS query, and propose several algorithms for processing OLS queries efficiently. A comprehensive experimental evaluation has been conducted using both real and synthetic data sets to demonstrate the efficiency and effectiveness of the proposed algorithms. Index Terms—Query processing, optimal-location-selection, spatial database, algorithm. Ç
- …